DATASOC: Richard, Bianca, Jason.
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
We will be using a dataset on the top songs on spotify by year along with variables such as genre, bpm, length of the song etc.
Taken from kaggle: https://www.kaggle.com/leonardopena/top-spotify-songs-from-20102019-by-year
data = pd.read_csv("top10s.csv", encoding = "ISO-8859-1")
data
data.rename(columns = {"top genre" : "genre", "pop" : "popularity"},
inplace = True)
data.head()
A convenient way to compare numeric values of several groups.
Compare the popularity of the top 10 artists by comparing how many top songs they have across the years.
value_counts() -> determines the frequency of each artist and sorts it in descending order.
head(10) -> takes the top 10 artists.
top_10 = data.artist.value_counts().head(10)
top_10
Obtain the top 10 artists (as strings) and put it in a list.
top_10_artists = top_10.index.tolist()
top_10_artists
Obtain the frequency of the top 10 artists and put it in a list.
top_10_artists_freq = top_10.tolist()
top_10_artists_freq
x = range(10)
plt.bar(x, top_10_artists_freq)
plt.xticks(x, top_10_artists, rotation = "vertical")
plt.title("Top 10 Artists with the most popular songs from 2010 to 2019")
plt.ylabel("Number of top songs")
plt.show()
A plot of frequencies or relative frequencies of values within different intervals or 'bins' that cover the range of all observed values in the sample.
When we want a graphical summary of quantitative data.
Compare the bpm of dance pop songs in 2010 and 2018
dance_pop = data[data.genre == "dance pop"]
dance_pop.head()
dance_pop_2010 = dance_pop[dance_pop.year == 2010]
dance_pop_2010.head()
dance_pop_2018 = dance_pop[dance_pop.year == 2018]
dance_pop_2018.head()
print("Number of top dance pop artists in 2010:",
len(set(dance_pop_2010.artist)))
print("Number of top dance pop artists in 2018:",
len(set(dance_pop_2018.artist)))
print("Mean BPM in 2010:", dance_pop_2010.bpm.mean())
print("Median BPM in 2010:", dance_pop_2010.bpm.median())
print("Mean BPM in 2018:", dance_pop_2018.bpm.mean())
print("Median BPM in 2018:", dance_pop_2018.bpm.median())
plt.hist(dance_pop_2010.bpm, 20, edgecolor = "black")
plt.show()
plt.subplot(nrows, ncols, index)
plt.subplot(2, 1, 1)
plt.title("Distribution of BPM in Dance Pop")
plt.hist(dance_pop_2010.bpm, 20, edgecolor = "black")
plt.ylabel("2010")
plt.subplot(2, 1, 2)
plt.hist(dance_pop_2018.bpm, 20, edgecolor = "black")
plt.ylabel("2018")
plt.show()
Looking at the two graphs we notice that the ranges are different so it can be difficult to compare.
We need to manually specify the range.
plt.subplot(2, 1, 1)
plt.title("Distribution of BPM in Dance Pop")
plt.hist(dance_pop_2010.bpm, 20, range = (40, 180), edgecolor = "black")
plt.ylabel("2010")
plt.subplot(2, 1, 2)
plt.hist(dance_pop_2018.bpm, 20, range = (40, 180), edgecolor = "black")
plt.ylabel("2018")
plt.show()
A good way to visualise how two quantitative variables are related in the data.
Find the relationship between how energetic a song is and how loud it is.
data_2019 = data[data.year == 2019]
data_2019.head()
plt.scatter(data_2019.nrgy, data_2019.dB, 10)
plt.title("Energy and Loudness of Songs in 2019")
plt.xlabel("Energy")
plt.ylabel("Decibels")
plt.show()
data_2019.nrgy.corr(data_2019.dB)
set() -> obtain all the years available
sorted() -> correct order
years_sort = sorted(set(data.year))
years_sort
for curr_year in years_sort:
data_year = data[data.year == curr_year]
plt.scatter(data_year.nrgy, data_year.dB, 10)
plt.title(curr_year)
plt.xlabel("Energy")
plt.ylabel("Decibels")
plt.show()
A Box Plot is the visual representation of the statistical five number summary of a given data set.
A Five Number Summary includes:
help visualize the distribution of quantitative values in a field. They are also valuable for comparisons across different categorical variables or identifying outliers, if either of those exist in a dataset.
Note: Note: different software and libraries such as Microsoft Excel, Seaborn and others may place the end whiskers and show outliers differently on box plots. Please understand your software's implementation well when you need to interpret results
# simple demo
value1 = [82,76,24,40,67,62,75,78,71,32,98,89,78,67,72,82,87,66,56,52]
value2=[62,5,91,25,36,32,96,95,3,90,95,32,27,55,100,15,71,11,37,21]
value3=[23,89,12,78,72,89,25,69,68,86,19,49,15,16,16,75,65,31,25,52]
value4=[59,73,70,16,81,61,88,98,10,87,29,72,16,23,72,88,78,99,75,30]
box_plot_data=[value1,value2,value3,value4]
plt.boxplot(box_plot_data,patch_artist=True,labels=['course1','course2','course3','course4'])
plt.show()
boxplot() function takes the data array to be plotted as input in first argument, second argument patch_artist=True , fills the boxplot and third argument takes the label to be plotted.
box=plt.boxplot(box_plot_data,vert=0,patch_artist=True,labels=['course1','course2','course3','course4'])
colors = ['cyan', 'lightblue', 'lightgreen', 'tan']
#zip: Two iterables are passed
for patch, color in zip(box['boxes'], colors):
patch.set_facecolor(color)
plt.show()
boxplot() function takes argument vert =0 which plots the horizontal box plot. Colors array takes up four different colors and passed to four different boxes of the boxplot with patch.set_facecolor() function.
# ex1
data[data['genre']=='pop']['artist'].value_counts().plot.pie(figsize=(10,10),autopct='%1.1f%%')
plt.title('Plotting of Pop song based on artist in percentage')
plt.show()
#autopct enables you to display the percent value using Python string formatting
# ex2 - with explode / your choice of colour
# Data to plot
labels = 'Python', 'C++', 'Ruby', 'Java'
sizes = [215, 130, 245, 210]
colors = ['gold', 'yellowgreen', 'lightcoral', 'lightskyblue']
explode = (0.1, 0, 0, 0) # explode 1st slice
# Plot
plt.pie(sizes, explode=explode, labels=labels, colors=colors,
autopct='%1.1f%%', shadow=True, startangle=140)
plt.axis('equal')
plt.show()
are usually used to show time series data - that is how one or more variables vary over a continuous period of time.
Line graphs are used to track changes over short and long periods of time. When smaller changes exist, line graphs are better to use than bar graphs.
Line graphs can also be used to compare changes over the same period of time for more than one group.
import pandas as pd
import matplotlib.pyplot as plt
Data = {'Year': [1920,1930,1940,1950,1960,1970,1980,1990,2000,2010],
'Unemployment_Rate': [9.8,12,8,7.2,6.9,7,6.5,6.2,5.5,6.3]
}
df = pd.DataFrame(Data,columns=['Year','Unemployment_Rate'])
df
plt.plot(df['Year'], df['Unemployment_Rate'], color='red', marker='o')
plt.title('Unemployment Rate Vs Year', fontsize=14)
plt.xlabel('Year', fontsize=14)
plt.ylabel('Unemployment Rate', fontsize=14)
# plt.grid(True)
plt.show()
#num of songs issued by the artist each year
data['artist'].value_counts().head(10).plot.line(figsize=(20,10))
plt.xlabel('Artist Name')
plt.ylabel('Number of song')
plt.title('Top 10 artist')
plt.show()
Visually represents data with color.
How popular are the songs of the top 10 artists? How important are break out hits?
num = 10
top_artists = []
top_song_labels = list(f"#{i} Song" for i in range(1,1+num))
top_song_popularities = np.empty((0, num))
for artist, count in data.artist.value_counts().head(num).iteritems():
top_artists.append(artist)
artist_top = data[data.artist == artist].sort_values("popularity", ascending=False).popularity.head(num).tolist()
artist_top += [0] * (num - len(artist_top))
top_song_popularities = np.concatenate((top_song_popularities,[artist_top]),axis=0)
fig, (ax1, ax2) = plt.subplots(2,figsize=(20,20))
im = ax1.imshow(top_song_popularities)
# We want to show all ticks...
ax1.axis('tight')
ax1.set(xticks=np.arange(len(top_song_labels)), xticklabels=top_song_labels,
yticks=np.arange(len(top_artists)), yticklabels=top_artists)
# Rotate the tick labels and set their alignment.
plt.setp(ax1.get_xticklabels(), rotation=45, ha="right",
rotation_mode="anchor")
# Loop over data dimensions and create text annotations.
for i in range(len(top_artists)):
for j in range(len(top_song_labels)):
text = ax1.text(j, i, top_song_popularities[i, j], ha="center", va="center", color="w")
ax1.set_title("Popularity of Top 10 songs for the best artists.")
fig.tight_layout()
# Showing Bar Graph to accompany Color Map
x = range(10)
plt.bar(x, top_10_artists_freq)
plt.xticks(x, top_10_artists, rotation = "vertical")
plt.title("Top 10 Artists with the most popular songs from 2010 to 2019")
plt.ylabel("Number of top songs")
plt.show()
Plots joint distributions as a scatterplot.
Plots marginal distribution along the diagonal as histograms.
Useful for looking at correlation and relationships across our variables.
import seaborn as sns
sns.set_style("ticks")
sns.pairplot(data)
plt.show()
We can change the shape of the distribution.
sns.pairplot(data, diag_kind = "kde") # kernel density estimate
plt.show()
We can choose specific variables for the pairplot.
sns.pairplot(data, vars = ["popularity", "bpm"])
plt.show()
def mass_plot(data, x_var, y_var, plot_type):
years_sort = sorted(set(data.year))
for curr_year in years_sort:
data_year = data[data.year == curr_year]
if (plot_type == "bar"):
plt.bar(data_year[x_var], data_year[y_var], 10)
elif (plot_type == "scatter"):
plt.scatter(data_year[x_var], data_year[y_var], 10)
elif(plot_type == "line"):
plt.plot(data_year[x_var], data_year[y_var], 10)
plt.title(curr_year)
plt.xlabel(x_var)
plt.ylabel(y_var)
plt.show()
plot_type = "line"
x_var = "bpm"
y_var = "popularity"
mass_plot(data, x_var, y_var, plot_type)
plot_type = "scatter"
x_var = "nrgy"
y_var = "dB"
mass_plot(data, x_var, y_var, plot_type)
We can export our graphs as images with plt.savefig()
for curr_year in years_sort:
data_year = data[data.year == curr_year]
plt.scatter(data_year.nrgy, data_year.dB, 10)
plt.title(curr_year)
plt.xlabel("Energy")
plt.ylabel("Decibels")
plt.savefig(str(curr_year), dpi = 200) # dpi = dots per inch
plt.clf() # clears the current plot
This is the code for the scatterplot of the relationship between the energy and loudness of a song, which we did before. With any plot, we can export it as an image by replacing plt.show() with plt.savefig().
This specific code will export each scatterplot form 2010 to 2019 as an image.
dpi = dots per inch where we can change the resolution of the image.